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Abstract 

We consider the clustering with diversity problem: given a set of colored points 
in a metric space, partition them into clusters such that each cluster has at least I 
points, all of which have distinct colors. We give a 2-approximation to this prob- 
lem for any I when the objective is to minimize the maximum radius of any cluster. 
We show that the approximation ratio is optimal unless P = NP, by providing a 
matching lower bound. Several extensions to our algorithm have also been devel- 
oped for handling outliers. This problem is mainly motivated by applications in 
privacy-preserving data publication. 

Keywords: Approximation algorithm, k-center, k-anonymity, 1-diversity 

1 Introduction 

Clustering is a fundamental problem with a long history and a rick collection of results. 
A general clustering problem can be formulated as follows. Given a set of points P in 
a metric space, partition P into a set of disjoint clusters such that a certain objective 
function is minimized, subject to some cluster-level and/or instance-level constraints. 
Typically, cluster-level constraints impose restrictions on the number of clusters or on 
the size of each cluster. The former corresponds to the classical fc-center, fc-median, 
fc-means problems, while the latter has recently received much attention from various 
research communities lfT7l [Tl [T6l . On the other hand, instance-level constraints specify 
whether particular items are similar or dissimilar, usually based on some background 
knowledge [26 3|. In this paper, we impose a natural instance-level constraint on a 
clustering problem, that the points are colored and all points partitioned into one cluster 
must have distinct colors. We call such a problem clustering with diversity. Note that 
the traditional clustering problem is a special case of ours where all points have unique 
colors. 

As an illustrating example, consider the problem of choosing locations for a number 
of factories in an area where different resources are scattered. Each factory needs at 
least £ different resources allocated to it and the resource in one location can be sent to 
only one factory. This problem corresponds to our clustering problem where each kind 
of resource has a distinct color, and we have a lower bound £ on the the cluster size. 
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The main motivation to study clustering with diversity is privacy preservation for 
data publication, which has drawn tremendous attention in recent years in both the 
database community flU El [29] HQ |28] El E3 and the theory community fl] |22l |2] QU 
fTTI . The goal of all the studies in privacy preservation is to prevent linking attacks 
l25l . Consider the table of patient records in Figure[TJa), usually called the microdata. 
There are three types of attributes in a microdata table. The sensitive attribute (SA), 
such as "Disease", is regarded as the individuals' privacy, and is the target of protec- 
tion. The identifier, in this case "Name", uniquely identifies a record, hence must be 
ripped off before publishing the data. The rest of the attributes, such as "Age", "Gen- 
der", and "Education", should be published so that researchers can apply data mining 
techniques to study the correlation between these attributes and "Disease". However, 
since these attributes are public knowledge, they can often uniquely identify individu- 
als when combined together. For example, if an attacker knows (i) the age (25), gender 
(M), and education level (Master) of Bob, and (ii) Bob has a record in the microdata, 
s/he easily finds out that Tuple 2 is Bob's record and hence, Bob contracted HIV. There- 
fore, these attributes are often referred to as the quasi-identifiers (QI). The solution is 
thus to make these QIs ambiguous before publishing the data so that it is difficult for an 
attacker to link an individual from the QIs to his/her SA, but at the same time we want 
to minimize the amount of information loss due to the ambiguity introduced to the QIs 
so that the interesting correlations between the QIs and the SA are still preserved. 

The usual approach taken to prevent linking attacks is to partition the tuples into 
a number of Ql-groups, namely clusters, and within each cluster all the tuples share 
the same (ambiguous) QIs. There are various ways to introduce ambiguity. A popular 
approach, as taken by Q~|, is to treat each tuple as a high-dimensional point in the QI- 
space, and then only publish the center, the radius, and the number of points of each 
cluster. To ensure a certain level of privacy, each cluster is required to have at least k 
points so that the attacker is not able to correctly identify an individual with confidence 
larger than 1/k. This requirement is referred to as the k-ANONYMlTY principle ITT1 I221 . 
The problem, translated to a clustering problem, can be phrased as follows: Cluster 
a set of points in a metric space, such that each cluster has at least r points. When 
the objective is to minimize the maximum radius of all clusters, the problem is called 
r-GATHERlNG and a 2-approximation is known (TJ. 

However, the fc- ANONYMITY principle suffers from the homogeneity problem: A 
cluster may have too many tuples with the same SA value. For example, in Figure[TJb), 
all tuples in Ql-group 1, 3, and 4 respectively have the the same disease. Thus, the at- 
tacker can infer what disease all the people within a Ql-group have contracted without 
identifying any individual record. The above problem has led to the development of 
many SA-aware principles. Among them, I-DIVERSITY lETl is the most widely de- 
ployed lfl4l |2T1 [THl |29l l28l due to its simplicity and good privacy guarantee. The 
principle demands that, in each cluster, at most l/£ of its tuples can have the same S A 
value. Figure [TJc) shows a 2-diverse version of the microdata. In an ^-diverse table, 
an attacker can figure out the real SA value of the individual with confidence no more 
than l/£. Treating the S A values as colors, this problem then exactly corresponds to 
the clustering with diversity problem defined at the beginning, where we have a lower 
bound £ on the cluster size. 

In contrast to the many theoretical results for r-GATHERlNG and k- ANONYMITY 
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Figure 1: (a) The microdata; (b) A 2-anonymous table; (c) An 2-diverse table; 



|Hl|2j|22l, no approximation algorithm with performance guarantees is known for i- 
DlVERSlTY, even though many heuristic solutions have been proposed l20l[T4ll2Tl . 

Clustering with instance-level constraints and other related work. Clustering with 
instance-level constraints is a developing area and begins to find many interesting ap- 
plications in various areas such as bioinformatics |5|, machine learning ||26ll27l , data 
cleaning |4|, etc. Wagstaff and Cardie in their seminal work ll26l considered the follow- 
ing two types of instance-level hard constraints: A must-link (ML) constraint dictates 
that two particular points must be clustered together and a cannot-link ( CL) constraint 
requires they must be separated. Many heuristics and variants have been developed 
subsequently, e.g. ll27l[3T1 . and some hardness results with respect to minimizing the 
number of clusters were also obtained iflOl . However, to the best of our knowledge, 
no approximation algorithm with performance guarantee is known for any version of 
the problem. We note that an ^-diverse clustering can be seen as a special case where 
nodes with the same color must satisfy CL constraints. 

As opposed to the hard constraints imposed on any clustering, the correlation clus- 
tering problem J6) considers soft and possibly conflicting constraints and aims at mini- 
mizing the violation of the given constraints. An instance of this problem can be repre- 
sented by a complete graph with each edge labeled (+) or (-) for each pair of vertices, 
indicating that two vertices should be in the same or different clusters, respectively. 
The goal is to cluster the elements so as to minimize the number of disagreements, i.e., 
(-) edges within clusters and (+) edges crossing clusters. The best known approxima- 
tions for various versions of the problem are due to Ailon et al. [3) . If the number of 
clusters is stipulated to be a small constant k, there is a polynomial time approximation 
scheme lfT31l . In the Dedupalog project, Arasu et al. Q considered correlation cluster- 
ing together with instance-level hard constraints, with the aim of de-duplicating entity 
references . 

Approximation algorithms for clustering with outliers were first considered by 
Charikar et al. (9). The best known approximation factor for r-GATHERlNG with out- 
liers is 4 due to Aggrawal et al. fT). 



Our results. In this paper, we give the first approximation algorithms to the cluster- 
ing with diversity problem. We formally define the problem as follows. 

Definition 1 (^"-DIVERSITY) Given a set of n points in a metric space where each of 
them has a color, cluster them into a set C of clusters, such that each cluster has at least 
£ points, and all of its points have distinct colors. The goal is to minimize the maximum 
radius of any cluster. 

Our first result (Section[2]i is a 2-approximation algorithm for ^-DIVERSITY. The al- 
gorithm follows a similar framework as in IH, but it is substantially more complicated. 
The difficulty is mainly due to the requirement to resolve the conflicting colors in each 
cluster while maintaining its minimum size I. To the best of our knowledge, this is first 
approximation algorithm for a clustering problem with instance-level hard constraints. 

Next, we show that this approximation ratio is the best possible by presenting a 
matching lower bound (Section |3). A lower bound of 2 is also given in HI for r- 
Gathering. But to carry that result over to ^-Diversity, all the points need to have 
unique colors. This severely limits to applicability of this hardness result. In Section[3] 
we give a construction showing that even with only 3 colors, the problem is NP-hard 
to approximate within any factor strictly less than 2. In fact, if there are only 2 colors, 
we show that the problem can be solved optimally in polynomial time via bipartite 
matching. 

Unlike r-GATHERlNG, an instance to the ^-DIVERSITY problem may not have a fea- 
sible solution at all, depending on the color distribution. In particular, we can easily 
see that no feasible clustering exists when there is one color that has more than [n/ 1\ 
points. One way to get around this problem is to have some points not clustered (which 
corresponds to deleting a few records in the ^-DIVERSITY problem). Deleting records 
causes information loss in the published data, hence should be minimized. Ideally, we 
would like to delete points just enough such that the remaining points admit a feasi- 
ble ^-diverse clustering. In Section|H we consider the (.-DIVERSITY-OUTLIERS problem, 
where we compute an ^-diverse clustering after removing the least possible number of 
points. We give an 0(l)-approximation algorithm to this problem. 

Our techniques for dealing with diversity and cluster size constraints may be useful 
in developing approximation algorithms for clustering with more general instance-level 
constraints. 

2 A 2-Approximation for ^-Diversity 

In this section we assume that a feasible solution on a given input always exists. We 
first introduce a few notations. Given a set of n points in a metric space, we construct 
a weighted graph G(V, E) where V is the set of points and each vertex v 6 V has a 
color c(v). For each pair of vertices u, v 6 V with different colors, we have an edge 
e = (u, v), and its weight w(e) is just their distance in the metric space. For any 
u, v € V, let distcfu, v) be the shortest path distance of u, v in graph G. For any set 
ACV, let N G ( A) be the set of neighbors of A in G. For a pair of sets ACV,BCV, 
let Eg(A;B) = {(a,b) \ a E A,b G B, (a,b) e E(G)}. The diameter of a cluster 
C of nodes is defined to be d(C) = max 11 ,„ e c(!«(e(ii 1 ti))). Given a cluster C and 



its center v, the radius r(G) of G is defined as maximum distance from any node of 
G to v, i.e., r(G) = ma,x u£ c w(u, v). By triangle inequality, it is obvious to see that 
\d{C) < r(G) < d{C). 

A star forest is a forest where each connected component is a star. A spanning 
star forest is a star forest spanning all vertices. The cost of a spanning forest T is the 
length of the longest edge in T. We call a star forest semi-valid if each star component 
contains at least I colors and valid if it is semi-valid and each star is polychromatic, i.e., 
each node in the star has a distinct color. Note that a spanning star forest with cost R 
naturally defines a clustering with largest radius R. Denote the radius and the diameter 
of the optimal clustering by r* and d* , respectively. 

We first briefly review the 2-approximation algorithm for the r-GATHERlNG prob- 
lem |l j, which is the special case of our problem when all the points have distinct 
colors. Let ei, 62, . . . be the edges of G in a non-decreasing order of their weights. 
The general idea of the r-GATHERlNG algorithm Q is to first guess the optimal radius 
R by considering each graph Gj formed by the first i edges Ei = {e l7 . . . , e^}, as 
i = 1,2, ... . It is easy to see that the cost of a spanning star forest of G; is at most 
w(ei). For each G, (1 < i < m), the following condition is tested (rephrased to fit 
into our context): 

(I) There exists a maximal independent set / such that there is a spanning star forest 
in Gi with the nodes in / being the star centers, and each star has at least r nodes. 

It is proved [ 1 ] that the condition is met if the length of ei is d* . The condition implies 
the radius of our solution is at most d* which is at most 2r*. Therefore, we get an 
2-approximation. In fact, the independent set / can be chosen greedily and finding the 
spanning star forest can be done via a network flow computation. 

Our 2-approximation for the ^-diversity problem follows the same framework, that 
is, we check each Gi in order and test the following condition: 

(II) There exists a maximal independent set / such that there is a valid spanning star 
forest in Gi with the nodes in / being the star centers. 

The additional challenge is of course that, while condition (I) only puts a constraint 
on the size of each star, condition (II) requires both the size of each star to be at least I 
and all the nodes in a star have distinct colors. Below we first give a constructive algo- 
rithm that for a given Gi, tries to find an / such that condition (II) is met. Next we show 
that when w(ei) — d* , the algorithm is guaranteed to succeed. The approximation ratio 
of 2 then follows immediately. 

To find an / to meet condition (II), the algorithm starts with an arbitrary maximal 
independent set /, and iteratively augments it until the condition is met, or fails oth- 
erwise. In each iteration, we maintain two tests. The first one, denoted by flow test 
F-Test(G;, I), checks if there exists a semi-valid spanning star forest in Gi with nodes 
in / being star centers. If / does not pass this test, the algorithm fails right away. Oth- 
erwise we go on to the second test, denoted by matching test M-Test(G;, I), which 
tries to find a valid spanning star forest. If this test succeeds, we are done; otherwise 
the failure of this test yields a way to augment / and we proceed to the next iteration. 
The algorithm is outlined in AlgorithmQ] 



Algorithm 1: Algorithm to find an I in Gi to meet condition (II) 

1 Let I be an arbitrary maximal independent set in Gi ; 

2 while F-TEST(Gi, /) is passed do 

3 (S, S') <- M-TEST(Gi,/) /* S C V, S' C I */; 

4 if S = then 

5 Succeed; 

6 else 

7 7 «- 1 - S' + S\ 

8 Add nodes to I until it is a maximal independent set; 

9 Fail; 



We now elaborate on F-Test and M-Test. F-TEST(Gi, I) checks if there is a span- 
ning star forest in Gi with I being the star centers such that each star contains at least 
£ colors. As the name suggests, we conduct the test using a network flow computation. 
We first create a source s and a sink t. For each node v £ V, we add an edge (s, v) 
with capacity 1, and for each node Oj £ 1(1 < j < \I\), we create a vertex Oj and 
add an outgoing edge (oj , t) with capacity lower bound £. For each node Oj £ I and 
each color c, we create a vertex pj tC and an edge (pj, c , Oj) with capacity upper bound 
1. For any v £ V such that (v, Oj) £ Ei or w = Oj, and v has color c, we add an 
edge from v to pj >c without capacity constraint. Finally, we add one new vertex for 
each Oj £ I, connect all v £ V to without capacity constraint if (v,Oj) £ E or 
i> = oj, and connect to t without capacity constraint. The capacity upper bound of 
(Pj,c Oj ) forces at most one node with color c to be assigned to Oj . Therefore, all nodes 
assigned to Oj have distinct colors. The capacity lower bounds of (oj,t)s require that 
each cluster has at least £ nodes. Nodes o'jS are used to absorb other unassigned nodes. 
It is not difficult to see that there exists a semi-valid spanning star forest with nodes in 
I being star centers in Gi if an ??-units flow can be found. In this case we say that the 
F-TEST is passed. See Figure|2]for an example. Note that a network flow problem with 
both capacity lower bounds and upper bounds is usually referred to as the circulation 
problem, and is polynomially solvable |fl9l . 

Once Gi and I pass F-TEST, we try to redistribute those vertices that cause color 
conflicts. We do so by a bipartite matching test M-TEST(Gi, I) which returns two 
vertex sets S and S' that are useful later. Concretely, we test whether there exists a 
matching in the bipartite graph B(I — C,C — I; E^ (I — C;C — I)) for each color 
class G such that all vertices in G — I are matched. If such matchings can be found 
for all the colors, we say that the M-Test is passed. Note that all these matchings 
together give a spanning star forest such that each star is polychromatic. However, this 
does not guarantee that the cardinality constraint is preserved. The crucial fact here 
is that I passes both F-Test and M-Test. In Lemma [2] we formally prove that there 
exists a valid spanning star forest with nodes in I as star centers if and only if Gi and 
I pass both F-Test and M-Test. To actually find a valid spanning star forest, we can 
again use the network flow construction in F-Test but without the o'j nodes. If M-TEST 
fails, we know that for some color class G, there exists a subset S C G — I such that 
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Figure 2: The flow network construction. On the left is the original graph, / = 
{^2, V4}, 1 = 1. On the right is the corresponding flow network. Thick edges denote a 
feasible flow of value \I\£ = 4. 

the size of its neighbor set |iVs(5)| is less than |5| by Hall's theorem |T9l . In this 
case, M-Test returns (S, Nb (5*)); such a set S can be found by a maximum matching 
algorithm. Then we update the independent set I <— I — Nb(S) + S; we show that 
I is still an independent set in Lemma Q] Finally, we add nodes to I arbitrarily until it 
becomes a maximal independent set. Then, we start the next iteration with the new I. 
Since | *S* | > \Nb(S)\, we increase |/| by at least one in each iteration. So the algorithm 
terminates in < n iterations. 

Before proving that Algorithm [T] is guaranteed to succeed when iu(ej) = d*, we 
prove the two lemmas left in the description of our algorithm. The first lemma ensures 
that I is always an independent set. 

Lemma 1 The new set I <— I — S' + S obtained in each update is still an independent 
set in Gi. 

Proof: Since all vertices in S have the same color, there is no edge among them. 
Therefore, we only need to prove that there is no edge between I — S 1 and S, which is 



The second lemma guarantees that we find a feasible solution if both tests are passed. 

Lemma 2 Given Gi = G(V,Ei) and I, a maximal independent set of Gi, both F- 
TEST(Gi,I) and M-TEST(Gi,I) are passed if and only if there exists a valid spanning 
star forest in Gi with nodes in I being star centers. 

Proof: The "if" part is trivial. We only prove the "only if" part. Suppose Gi and / pass 
both F-Test(Gj,7) and M-Test(G 1 ,I). Consider a semi-valid spanning star forest ob- 
tained in Gi after F-Test. We delete a minimal set of leaves to make it a valid star (not 
necessarily spanning) forest F. Consider the bipartite graph B(I — C;C — I, Ed (I — 
C; C — /)) for each color class C. We can see FOB is a matching in B. Since / passes 
M-Test, we know that there exists a maximum matching such that all nodes in C — I 
can be matched. If we use the Hungarian algorithm to compute a maximum matching 
with FflBas the initial matching, the nodes in I — C which are originally matched 
will still be matched in the maximum matching due to the property of the alternating 



trivial since S' = 



N B (S). 
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path augmentation^. Therefore, the following invariants are maintained: each star is 
polychromatic and has at least I colors. By applying the above maximum matching 
computation for each color class, we obtain a valid spanning star forest in Gi. □ 

Finally, we prove that Algorithm[T]is guaranteed to succeed on G;* for the maximal 
index i* such that w(e 2 * ) = d* , where d* is the optimal cluster diameter of any valid 
spanning star forest of G. 

Lemma 3 Algorithm\l\will succeed on Gi*. 

Proof: Suppose C* = {G*, . . . , G^* } is the set of clusters in the optimal clustering 
with cluster diameter d* . Since G;* include all edges of weights no more than d* , each 
G* induces a clique in Gi* for all 1 < j < k*, thus it contains at most one node in 
any independent set. Therefore, any maximal independent set I in Gi* can pass F- 
TEST(Gi« , I), and we only need to argue that I will also pass M-Test(G;. , /). Each 
update to the independent set / increases the size of / by at least 1 and the maximum 
size of / is k* . When \I\ = k*, each G* contains exactly one node in / and this / must 
be able to pass M-TEST(Gi* , I). So Algorithm[T]must succeed in some iteration. □ 

By Lemma [3] the cost of the spanning star forest found by Algorithm Q] is at 
most d* . Since the cost of the optimal spanning forest is at least d*/2, we obtain a 
2-approximation. 

Theorem 1 There is a polynomial-time 2-approximation for (-DIVERSITY. 

3 The Lower Bound 

In this section, we show that ^-DIVERSITY is NP-hard to approximate within a factor 
less than 2 even when there are only three colors. Therefore the approximation ratio 
given in the Section [2] for ^-DIVERSITY is tight. Note that if there are two colors, the 
problem is polynomially solvable. Indeed, if there are only two colors we can use the 
the following simple algorithm to obtain an optimal solution. We start with an empty 
graph and add edges one by one in an increasing order of their weights (as before, we 
just add those edges whose endpoints' colors are different), getting a series of threshold 
graphs Gj(l < i < m). For each graph Gi, we try to find a perfect matching between 
the two color classes. It is easy to see w(ei) is the optimal solution where i is the 
smallest such that a perfect matching in Gi exists. 

Theorem 2 There is no polynomial-time approximation algorithm for (.-DIVERSITY that 
achieves an approximation factor less than 2 unless P = NP. 

To prove the NP-hardness for three colors, we show first the following problem 
is NP-hard: Given a 3-colorable graph G = (V, E) and a feasible 3-coloring, decide 

1 Recall that an augmenting path P (with respect to matching M) is a path starting from unmatched node, 
alternating between unmatched and matched edges and ending also at an unmatched node (for example, see 
1231 ). By taking the symmetric difference of P and M, which we call augmenting on P, we can obtain a 
new matching with one more edge. 
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Figure 3: (a)the gadget; (b) two possible partitions of the gadget. Nodes in thick circle 
are corner nodes. 



if V can be partitioned into subsets of size three such that the three vertices in each 
subset are connected and the colors of them are all different. We call such a partition 
a P(ath)^-partition of G. Note that the NP-hardness of the P3 -partitioning problem 
directly leads to the fact that ^-DIVERSITY cannot be approximated within a factor less 
than 2, since if we assign the weights of all edges in G to be 1 and consider the metric 
completion^] of G, then the optimal solution of ^-DIVERSITY on G is 1 if G admits a 
i-3-partitioning and at least 2 otherwise. 

We reduce from the well-known 3- dimensional matching problem [13] to the Pa- 
partition problem. Recall that in a 3-dimensional matching instance, we are given 
a tripartite hyper-graph G = (X U Y U Z, E) with color classes X, Y, Z such that 
\X\ = \Y\ = \Z\. Each hyper-edge is of the form (x, y, z), x £ X,y G Y, z G Z. 
A perfect matching is a set M C E of hyper-edges such that each vertex is incident 
to exactly one edge in M. Given a 3-dimensional matching instance G(V, E), we 
construct a 3-colorable graph G' as well as a feasible 3-coloring such that G has a 
perfect matching if and only if G' can be P3 -partitioned. 

The key component of our reduction is the gadget depicted in Figure [3] G' has a 
copy of the vertices of G. We color the vertices in X, Y and Z with color 1,2 and 3, 
respectively. For each hyperedge e = (x, y, z) £ E(G), we attach a distinct gadget to 
G' by identifying x,y,z with the three corner nodes of the gadget, respectively (corner 
nodes are those in thick circles in Figure Oa)). The gadget has the following nice 
property. 

Property 1 If G' can be P^-partitioned, then any P^-partition of G 1 restricted in one 
gadget can only take one of the two forms shown in Figure\3\b). 

The proof of the property is graphically obvious owing to the structure of the gad- 
get. One can easily make it rigorous by a case by case analysis. To relate the partition 
of each gadget to the 3-dimensional matching problem, we use the following reduction. 
We take e(x, y, z) as a matching edge if and only if the Pj-partition on G' restricted in 
the corresponding gadget takes the first form in FigureOb). It is easy to see that G has 
a perfect matching if and only if G' can be P3 -partitioned. 



2 The metric completion of G(V, E) is a complete graph with vertex set V and the weight of edge (u, v) 
defined by the shortest path distance between u and v in G for every u, v 6 V. 



4 Dealing with Unqualified Inputs 



For the ^-DIVERSITY problem, a feasible solution may not exist depending on the in- 
put color distribution. The following simple lemma gives a necessary and sufficient 
condition for the existence of a feasible solution. 

Lemma 4 There exists a feasible solution for (-DIVERSITY if and only if the number of 
nodes with the same color c is at most for each color c. 

Proof: The "only if" part is trivial. We only show the "if" part. Suppose the color 
class C contains the most number of nodes. We create |C| empty clusters, and process 
color classes one after another. For each node v, we put it into the cluster currently 
containing the least number of nodes provided that the cluster does not contain a node 
having the same color as v. Note that during the process, it can be easily shown that the 
sizes of clusters differ by at most one by induction. Therefore, each cluster contains at 

least y^_J > £ nodes at the end of the process. □ 

To cluster an instance without a feasible solution, we must exclude some nodes as 
outliers. The following lemma characterizes the minimum number of outliers. 

Lemma 5 Let Ci, C2, ■ ■ ■ , Ck be the color classes sorted in the non-increasing order 
of their sizes. 

1. Let p be the maximum integer satisfying 2~2i=i mm (P> IQI) ^ P^- The minimum 
number of outliers is given by q = 2~2i=i m ax (0, |C;| — p) and p is the number 
of clusters when we exclude q outliers. 

2. p£<n~q <p(£+l). 

Proof: Let p* be the maximum number of clusters in any feasible solution. It is easy to 
see that we must exclude at least 2»=i max (0, |C,| — p*) outliers. Thus the number 
of points left is at most 

k k 

^(|C 4 |-max(0 I |C i |-p*))=£;inin(p* I |Ci|) I 
»=i i=i 

which should be at least p*i. Since we pick p to be the maximum integer such that 
2,=i mm (Pi \ Ci\) > pi, we have p > p* . Thus q is a lower bound on the number 
of outliers. Moreover, if we delete max (0, |Cj| — p) nodes from Ci, the remainders 
admit a feasible clustering from LemmaH] Therefore, q is also a upper bound, and the 
proof of part 1 is completed. 

n — q > p£is obvious. The n — q < p(£ + 1) can be seen as follows. Suppose it is 
not true. We have 



k 

n 

i=l 



-]Tmax(0, \d\-p) = ^(|C< |-max(0, \Q\-p)) = ^min(p, \Q\) > (p+l)t 

l — l l — l l — l 

Thus, ^i=i m ^ n (p + 1) |C Z -|) > (p+l)£ which contradicts the maximality of p. □ 



With lemma|5]at hand, it is natural to consider the following optimization problem: 
find an ^-DIVERSITY solution by clustering n — q points such that the maximum cluster 
radius is minimized. We call this problem f-DlVERSlTY-OUTLlERS. From Lemma [5] 
we can see that p is independent on the metric and can be computed in advance. In 
addition, implicit from Lemma [5] is that the number of outliers of each color is also 
fixed, but we need to decide which points should be chosen as outliers. 

In the fortunate case where we have a color class C with exactly p nodes, we know 
that there is exactly one node of C in each cluster of any feasible solution. By using a 
similar flow network construction used in F-Test, we can easily get a 2-approximation 
using C as the cluster centers. However, the problem becomes much more difficult 
when the sizes of all color classes are different from p. Loosely speaking, our difficulty 
is two-fold: exponentially many choices of outliers and cluster centers. 

4.1 A Constant Approximation 

We first define some notations. We call color classes of size larger than p popular 
colors and nodes having such colors popular nodes. Other color classes have at most p 
nodes, and these nodes are unpopular. We denote the set of popular nodes by V and the 
set of unpopular nodes by M . Note that after removing the outliers, each popular color 
has exactly p nodes and each cluster will contain the same number of popular nodes. 
Let z be the number of popular nodes each cluster contains. We denote by G d the 
power graph of G in which two vertices u, v are adjacent if there is path connecting u 
and v with at most d edges. The length of the edge (u, v) in G d is set to be distG(«, v). 
Before describing the algorithm, we need the following simple lemma. 

Lemma 6 For any connected graph G, G 3 contains a Hamiltonian cycle which can be 
found in linear time. 

Proof: Let T be any spanning tree of G. It suffices to prove T 3 contains a Hamiltonian 
cycle. We root T at an arbitrary vertex r and denote the depth of a vertex v by depth (u) 
(depth(r) = 1). Let T v be the subtree rooted at v and Ch(v) be the set of children of 
v. Consider the Algorithmic Traverse(r). Clearly, the algorithm runs in linear time. 

Suppose the order obtained is {vx, t>2, ■ • ■ , v n }. We claim distT(«i, Vi+i) < 3 for 
1 < i < n — 1 and distr(ui, v n ) < 1. Note that the claim immediately implies the 
lemma. We prove the claim by induction on the size of the tree. Suppose depth (w) is 
odd (the other case can be proved similarly) and Ch(v) = {ui, ■ ■ ■ , Uk}- Let Oi = 
{vi,i, Vi : 2, ■ ■ ■ ,Vi t \o t \} be the traverse order of T Ui . It is easy to see the order produced 
by Traverse(v) is v, Oi, O2, ■ ■ • , Ok- It is also not hard to see that 

distT(w, < distr(«, ui)+distT(ui, ui.i) = distr(w, wi)+distT(vi,|Oih u m) — 2 

and d\st T (vk,\o k \,v) = d\st T (u k , v) = 1 
since Ui is last vertex in order Oi. We can also see that 

distT(^,|o,h^+i.i) < dist T (ui,w) + d\st T (v,u i+1 ) + dist T (w 4 +i, < 3. 



By induction hypothesis, the proof is completed. 



□ 



Algorithm 2: Traverse(u) 

1 if depth(u) is odd then 

2 visit(v); 

3 for each (u e Ch(v)) Traverse(w); 

4 else 

s for each (u e Ch(v)) Traverse(w); 
6 visit(u); 



The algorithm still adopts the thresholding method, that is, we add edges one by 
one to get graphs G, = (V, Ei = {ei, e2, . . . , ei}), for i = 1,2,..., and in each Gi, 
we try to find a valid star forest that spans Gi except q outliers. Let d* be the diameter 
of the optimal solution that clusters n — q points, and i* be the maximum index such 
that w(ei) = d*. Let Gi [J\f] be the subgraph of Gj induced by all unpopular nodes. We 
define the ball of radius r around v to be B(v, r) = {u £ A/" | distg(w, u) < r}. For 
each Gi, we run the Algorithm: £-Diversity-Outliers(G 1 ) (see below). We proceed 
to Gj+i when the algorithm claims failure. 

The high level idea of the algorithm is as follows: Our goal is to show that the 
algorithm can find a valid star forest spanning n — q nodes in Gjf. It is not hard to see 
that this gives us an approximation algorithm with factor 28 x 2 = 56. First, we notice 
that F-Test can be easily modified to work for the outlier version by excluding all o'j 
nodes and testing whether there is a flow of value n — q. However, the network flow 
construction needs to know in advance the set of candidates of cluster centers. For this 
purpose, we attempt to attach a set U of p new nodes which we call virtual centers to 
Gi which serve as the candidates of cluster centers in F-Test. In the ideal case, if these 
virtual centers can be distributed such that each of them is attached to a distinct optimal 
cluster, F-Test can easily produce a 2-approximation. Since the optimal clustering is 
not known, this is very difficult in general. However, we show that there is way to 
carefully distribute the virtual centers such that there is a perfect matching between 
these virtual centers and the optimal cluster centers and the longest matching edge is 
at most 27 d*. This implies that there is a valid spanning star forest in GfJ (each star 
is formed by an optimal cluster together with the virtual center that matches the cluster 
center). Also, it is easy to see that each virtual center is at most 27d* + d* away from 
any other node in the same star. Therefore, it suffices to just run F-TEST(Gf, 8 , U) to 
find a feasible solution. 

Algorithm: ^-Diversity-Outliers (G, ). 

1 . If Gi [A/] contains a connected component with less than I — z nodes, we declare 
failure. 

2. Pick an arbitrary unpopular node v such that |B(u, w(ei))\ > £ — z and delete 
all vertices in this ball; repeat until no such node exists. Then, pick an arbitrary 
unpopular node v and delete all vertices in B(v, w(e,)); repeat until no unpopular 
node is left. Let Bi, B2, . . . , be the balls created during the process. If a ball 



contains at least I — z unpopular nodes, we call it big. Otherwise, we call it small. 

3. In Gi[Af], shrink each Bj into a single node bj. A node bj is big if Bj is big and 
small otherwise. We define the weight of bj to be fi(bj) = j^j. Let the resulting 
graph with vertex set {bj}j =1 be Di. 

4. For each connected component C of Di, do 

(a) Find a spanning tree Tc of C 3 such that all small nodes are leaves. If this 
is not possible, we declare failure. 

(b) Find (by Lemma|6j a Hamiltonian cycle P = {b\, 62, ... , bh, bh+i = ^1} 
over all non-leaf nodes of C such that dist^ (bj, bj+i) < 9w(ei). 

5. We create a new color class U of p nodes which will serve as "virtual centers" 
of the p clusters. These virtual centers are placed in Gi "evenly" as follows. 
Consider each connected component C in Di and the corresponding spanning 
tree Tc of C 3 . For each non-leaf node bj in Tc, let L(bj) be the set of leaves 
connected to bj in Tc, and let r](bj) = p(bj) + Ylb^&L(b) M(^) an ^ <5j = 

YjL=i Vi^x)- We attach [Si\ — L^'-iJ virtual centers to the center of by 
zero weight edges. If the total number of virtual centers used is not equal to p, 
we declare failure. Let Hi be the resulting graph (including all popular nodes, 
unpopular nodes and virtual centers). 

6. Find a valid star forest in H? s using U as centers, which spans n — q nodes (not 
including the nodes in U) by using F-Test. If succeeds, we return the star forest 
found, otherwise we declare failure. 

4.2 Analysis of the algorithm 

We show that the algorithm succeeds on Gi* . Since we perform F-Test on in 
which each edge is of length < 28gP, the radius of each cluster is at most 28d* < 56r* . 
Therefore, the approximation ratio is 56. 

Let be the graph obtained by adding virtual centers to Gi* as described above. 
Let C* = {C*, . . . , C*} be the optimal clustering. Let I* = {v^, . . . , v*} be the set of 
cluster centers of C* where v* is the center of C*. We denote the balls grown in step 2 
by Bi, . . . , Bfc. Let Vi be the center of B ; . 

The algorithm may possibly fail in step 1, step 4(a), step 5 and step 6. Obviously 
Gi* can pass step 1. Therefore, we only check the other three cases. 

Step 4(a) : We prove that the subgraph induced by all big nodes are connected in C 3 . 
Indeed, we claim that each small node is adjacent to at least one big node in C from 
which the proof follows easily. Now we prove the claim. Suppose bj is a small node 
and all its neighbors are small. We know that in Gi* [A/], Vj has at least £ — z — 1 
neighbors because Vj is an unpopular node and thus belongs to some optimal cluster. 
So we could form a big ball around Vj, thus contradicting to the fact that Vj is in a 
small ball. To find a spanning tree with all small nodes as leaves, we first assign each 



small node to one of its adjacent node arbitrarily and then compute a tree spanning all 
the big nodes. 



Step 5 : We can see that in each connected component C (with big nodes 61 , . . . , bh) 
in Di* , the total number of virtual centers we have placed is Ei=i ( |A'J — IA-iJ ) = 



|C| 



the 



where \C\ = E b] ec\ B 3 
A/] corresponding to C. This is 



L*J = LE^CMJ = LE 6 , e cM^)J = 

number of nodes in the connected component of Gi* 
at least the number of clusters created for the component C in the optimal solution. 
Therefore, we can see the total number of virtual centers created is at least p. On the 
other hand, from Lemma|5j2), we can see that p(£ — z) < \Af\ < (p + l)(£ — z). 

Hence, p = j^L = ^fj^ j > Ec F~jJ ■ wnere me summation is over all 
connect components. So, we prove that exactly p virtual centers were placed in Gi* . 

Step 6 : According to the high level idea discussed before, we only need to show that 
there is a perfect matching M between U and the set of optimal centers I* in HfJ . 
We consider the bipartite subgraph Q(U, I*, E H ij (U, I*)). From Hall's theorem, it 
suffices to show that |7Vq(S')| > IS"! for any S C U, which can be implied by the 
following lemma. 

Lemma 7 For any S C U, the union of the balls of radius 21 d* around the nodes of 
S, i.e, lj ug 5 27c?*), intersects at least \S\ optimal clusters in C*. 

Proof: We can assume w.l.o.g. all nodes of S are in a single connected component 
of Gi- [A/]. The generalization to several connected components is straightforward. 
Let P = {61, 62 • • • , bh} be the Hamiltonian cycle (found in Step 4(b)) for such a 
component (actually, the component after shrinking balls). Let P be the set of nodes 
bj G P such that at least one virtual center in S is attached to B j . 

We first assume |P| < h — 2. In this case, we claim that |U u es B(u,27d*)\ > 
(\S\ + 1)(£— z). We know by definition that the number of nodes in each big ball B_, plus 
nodes in those small balls B x attached to it (that is, b x £ L{bj)) is r](bj)(£ — z). P can 
be seen as a collection of subpaths of P. For each of those subpaths, say {bj, . . . , bf }, 
the number of nodes in S attached to it is at most 



X=J 



On the other hand, we can see that B(yj , 27d*) contains all nodes in Bj_i,Bj and Bj + i 
and all nodes in the small balls attached to Bj. This is because dist£>.» (bj, fcj+i) < 
9d* and each bj is obtained by shrinking a ball of radius at most d* . Therefore, 



(J B(v x ,27d* 



X=J 



j'+l 3' 

> E ^:)(^)>(E^ 6 x) +2 )(^)>(L^-lvi+ij)(^) 

x=j-l x=j 



3 B — B h , B h+1 — Bi. 



where the second inequality holds since r){bj) > 1 for any big node bj. Summing 
up all the subpaths of P proves the claim. From Lemma [5]2 and the fact that each 
cluster has exactly z popular nodes, we can see \S\ optimal clusters contains less than 
(£ — z)(\S\ + 1) unpopular nodes. Therefore, the lemma holds. 

If \P\ > h — 2, then U Mg s B(u, 27d*) contains all unpopular nodes in this compo- 
nent. The lemma also follows. □ 

Theorem 3 There is a polynomial-time algorithm for £-DlVERSlTY-OUTLlERS that pro- 
duces a 56-approximation. 



5 Further Directions 

This work results in several open questions. First, as in |fl~), we could also try to mini- 
mize the sum of the radii of the clusters. However, this seems to be much more difficult, 
and we leave it as an interesting open problem. Another open problem is to design con- 
stant approximations for the problem with any fixed number of outliers, that is, for a 
given number k, find an optimal clustering if at most k outliers can be removed. 

As mentioned in the introduction, our work can be seen as a stab at the more gen- 
eral problem of clustering under instance-level hard constraints. Although arbitrary 
CL (cannot-link) constraints seems hard to approximate with respect to minimizing the 
number of clusters due to the hardness of graph coloring 1101 . other objectives and 
special classes of constraints, e.g. diversity constraints, may still admit good approxi- 
mations. Besides the basic ML and CL constraints, we could consider more complex 
constraints like the rules proposed in the Dedupalog project |4|. One example of such 
rules says that whenever we cluster two points a and b together, we must also cluster c 
and d. Much less is known for incorporating these types of constraints into traditional 
clustering problems and we expect it to be an interesting and rich further direction. 
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